Two metrics on rooted unordered trees with labels

Background The early development of a zygote can be mathematically described by a developmental tree. To compare developmental trees of different species, we need to define distances on trees. If children cells after a division are not distinguishable, developmental trees are represented by the space \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}$$\end{document}T of rooted trees with possibly repeated labels, where all vertices are unordered. If children cells after a division are partially distinguishable, developmental trees are represented by the space \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {P}}$$\end{document}P of rooted trees with possibly repeated labels, where vertices can be ordered or unordered. Results On \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {T}}$$\end{document}T, the space of rooted unordered trees with possibly repeated labels, we define two metrics: the best-match metric and the left-regular metric, which show some advantages over existing methods. On \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {P}}$$\end{document}P, the space of rooted labeled trees with ordered or unordered vertices, there is no metric, and we define a semimetric, which is a variant of the best-match metric. To compute the best-match distance between two trees, the expected time complexity and worst-case time complexity are both \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n^2)$$\end{document}O(n2), where n is the tree size. To compute the left-regular distance between two trees, the expected time complexity is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n)$$\end{document}O(n), and the worst-case time complexity is \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}(n\log n)$$\end{document}O(nlogn). Conclusions For rooted labeled trees with (fully/partially) unordered vertices, we define metrics (semimetric) that have fast algorithms to compute and have advantages over existing methods. Such trees also appear outside of developmental biology, and such metrics can be applied to other types of trees which have more extensive applications, especially in molecular biology.


Background
In developmental biology, the early development of a zygote is a central topic. For most species, the zygote follows a highly deterministic process. For example, consider a zygote of Arabidopsis thaliana. In stage 1, the zygote divides asymmetrically along the apical-basal axis into two cells. In stage 2, the upper (apical) cell undergoes a symmetric horizontal (meridional) division, and the lower (basal) cell undergoes a vertical (equatorial) division. In stage 3, the upper two cells divide asymmetrically, and the lower two cells undergo symmetric vertical divisions. In stage 4, the upper four cells divide asymmetrically, the middle two cells do not divide, and the lower two cells undergo symmetric vertical divisions [1]. See Fig. 1 for illustrations of this process.
A mathematical representation of the zygote's early development is a developmental tree [2]. In this tree, each vertex represents a cell. Each cell has a label, representing the cell event it will perform, such as division (symmetric or asymmetric, horizontal or vertical), growth, and death. The root vertex is the zygote. Parent vertices (cells) and children vertices (cells) are linked by edges. Each level of this tree corresponds to all the cells at a given stage. See Fig. 2 for the developmental tree of Arabidopsis thaliana.

Open Access
Algorithms for Molecular Biology *Correspondence: yuew@g.ucla.edu Wang Algorithms for Molecular Biology (2022) 17:13 Zygotes of different species can have different early developments. See Figs. 3 and 4 for the early development of a sea urchin zygote and the corresponding developmental tree [3]. Starting from the zygote, sea urchin and Arabidopsis thaliana are different in division plane and division symmetry, and the cell numbers at stage 4 are already different (16 vs. 14). To quantitatively study the development of different organisms, we need a mathematical method to compare different developmental trees.
When we plot and compare developmental trees, we need to embed them in the plane, namely considering their planar embeddings. We put the zygote to the top, and its two children to the next lower level, and so on. An important question is: after a cell division, which child cell should be put to the left, and which to the right? In some situations, we cannot distinguish two children cells, and we can arbitrarily switch the position of these two children cells in the planar embedding. See Fig. 5 for equivalent planar embeddings of the same tree. Notice  [1]. Each unit is a cell. A green line between two cells means these two cells were just generated by a symmetric horizontal division. A blue line between two cells means these two cells were just generated by a symmetric vertical division. A red line between two cells means these two cells were just generated by an asymmetric division. An orange circle in a cell means this cell did not divide during the last stage Fig. 2 The developmental tree of Arabidopsis thaliana, T A , corresponding to Fig. 1. Each vertex represents a cell, and its label represents the cell event it performs. Label X means symmetric horizontal division; label Z means symmetric vertical division; label W means asymmetric division; label S means the cell stays still and does not divide that when we switch cells in the planar embedding, the corresponding cell events are also switched. In some situations, we can distinguish two children cells from an asymmetric division or by which cell inherits the mother centriole [4]. Then we can set a rule to determine which child cell is the left child in the planar embedding, and we cannot switch these two children cells. We start from the easier situation that we cannot distinguish children cells, so that in the planar embedding of the developmental tree, we can switch two subtrees for each vertex. Notice that a developmental tree has the zygote as its root, and different vertices can have the same label (cell event). The goal is to compare developmental trees.
In the language of graph theory, we need to define a metric on the space of rooted unordered trees with possibly repeated labels. Each tree has a root vertex, and each vertex has a label that is not necessarily unique. All vertices are unordered, meaning that we can switch left and right children in the planar embedding of each tree. Vertices and their labels are always associated, so that we do not distinguish a vertex and the label of a vertex. Therefore, when switching vertices, their labels are also switched. Such trees are not limited to developmental biology, but can be applied in various fields.
There are many metrics defined on trees, which can be roughly classified into three groups by their ideas: (1) Calculate the minimal operations needed to transform one tree into another, such as rearrangement distance [5], tree edit distance [6], edge rotation distance [7], and geodesic distance [8]. (2) Find the largest common structure of two trees, such as bottom-up distance [9] and subtree Early development of a sea urchin zygote [3]. Each unit is a cell. A green line between two cells means these two cells were just generated by a symmetric horizontal division. A red line between two cells means these two cells were just generated by an asymmetric division   [11], matching cluster distance [12], and triples distance [13]. However, many existing methods have specific requirements on trees, so that they are not applicable in our case (rooted unordered trees with possibly repeated labels). Some methods require that different vertices have different labels, and different trees have the same label set [5,7]. Some methods work for phylogenetic trees: only leaves vertices have labels; different vertices have different labels; different trees have the same label set [8,[11][12][13]. Some methods require that the trees are ordered [6].
In existing methods, the bottom-up distance [9] and the subtree distance [10] could work on rooted unordered trees with possibly repeated labels. The bottom-up distance between two trees T 1 , T 2 is defined as D BU (T 1 , T 2 ) = 1 − f / max(n 1 , n 2 ) , where n 1 , n 2 are the tree sizes, and f is the size of the largest common forest of two trees. The subtree distance D ST (T 1 , T 2 ) is defined almost the same as the bottom-up distance, except that f is the size of the largest common subtree of two trees. Both distances could be calculated in linear time [9,14]. These two methods have some disadvantages. For example, they are not robust under small perturbations on labels, and they do not compare non-common structures. See the next section for detailed discussions.
We develop two new metrics that apply for rooted unordered trees with possibly repeated labels: the bestmatch metric D BM and the left-regular metric D LR . For two unordered trees, the best-match metric searches all their planar embeddings, and compares the most similar pair. To calculate the left-regular metric for two unordered trees, we apply a procedure to fix one planar embedding for each unordered tree (its "regular form"), and compare the regular forms of these two unordered trees. These two metrics take into account different similarities between labels and different weights concerning their positions. These two metrics, especially the bestmatch metric, consider any common structures and compare non-common structures. To compute the bestmatch distance between two trees (binary or general kary), the expected time complexity and the worst-case time complexity are both O(n 2 ) , where n is the tree size. To compute the left-regular distance between two trees (binary or not), the expected time complexity is O(n) , and the worst-case time complexity is O(n log n).
The above discussions are for unordered trees, where all vertices are unordered. In some cases, we can distinguish two children cells, so that certain vertices are ordered. Then the space we need to consider consists of rooted trees with possibly repeated labels, where vertices can be ordered or unordered. This larger space has complicated structures that do not allow the existence of a proper metric. Existing methods and the left-regular metric introduced in this paper are not applicable. Nevertheless, the best-match metric can be slightly modified to become a semimetric that works in this scenario.
The main text consists of the following contents: compare existing methods and our new methods; introduce related terminologies in graph theory; define two metrics on the space of rooted unordered trees with possibly repeated labels; define a semimetric on the space of rooted trees with possibly repeated labels, where vertices can be ordered or unordered.

Comparison of existing methods and new methods
In this section, we compare the performance of existing methods and new methods on rooted unordered trees with possibly repeated labels, so as to explain the motivation to develop new methods. The examples used are illustrated in Figs. 6, 7 and 8. See Table 1 for a summary of these comparisons.
Compared to the left-regular metric D LR , especially to the best-match metric D BM introduced in this paper, the bottom-up distance D BU [9] and the subtree distance D ST [10] have some disadvantages. In Fig. 6, T 1 , T 2 have the same distribution of leaves labels, while T 1 , T 3 have different distributions of leaves labels. However, D BU (T 1 , The reason is that D BU and D ST only consider common structures, but not their detailed patterns. D BM and D LR can recognize the difference: In Fig. 7, T 4 , T 5 have the same tree topology, while T 4 , T 6 have different tree topologies. However, The reason is that D BU and D ST do not compare non-common structures. D BM and D LR can recognize the difference: In Fig. 8, T 7 , T 8 only differ by a leaf label, while T 7 , T 9 are much more different. However, . The reason is that D BU and D ST only consider certain common structures (sub-forest and subtree). D BM and D LR consider any common structures and recognize that T 7 , T 8 are more similar: Besides, for two vertices with different labels, D BU and D ST only know they are different, but not concerning how different they are. In reality, such as in comparing developmental trees, some labels are very different, while some  labels are rather similar. The position of vertices can also be concerned. In general, a label difference closer to the root should be more crucial. In D BM and D LR , different distances between labels and different weights on vertices can be introduced naturally.
The above discussion explains our motivation to develop the best-match metric D BM and the left-regular metric D LR . However, D BM and D LR also have disadvantages.
In Fig. 8, T 7 , T 10 only differ by a leaf label. In this case, The reason is that D LR is not always robust under small perturbations on labels, similar to D BU and D ST . D BM is robust under small perturbations on labels.
In Fig. 8, inserting one vertex to T 7 produces T 11 . In this case, D BU (T 7 , T 11 ) = 1/7 , D ST (T 7 , T 11 ) = 1/7 , but D BM (T 7 , T 11 ) = 9 , D LR (T 7 , T 11 ) = 9 . The reason is that D BM and D LR are not robust under small perturbations on the tree topology, especially perturbations near the roots. D BU and D ST are more robust to the change of tree topology near the roots.
In summary, our methods outperform the existing methods in most cases. In general, we recommend the best-match metric D BM . If time cost is a major concern, the left-regular metric D LR can be applied.

Trees
In graph theory, a rooted tree is a connected acyclic undirected graph, where one vertex v 0 is designated as the root. Some vertices are linked by edges. For each vertex v i , there is a unique path (edge sequence) that connects v i and the root v 0 . The number of edges in this path is called the depth of v i . The depth of the root v 0 is stipulated as 0. The depth of a tree is the largest depth of its vertices. The kth level (or level k) of a tree consists of all vertices whose depths are k. If the depth of a tree is m, it is also called an m-level tree. If there is an edge between two vertices v i , v j , and the depth of v i is smaller than the depth of v j , then v i is the parent vertex of v j , and v j is a child vertex of v i . For v i and its child vertex v j , the tree with root v j is called a subtree of v i . A vertex without children vertices is called a leaf vertex [15].
In this paper, each vertex has a label, and different vertices might have the same label. The set of possible labels L can have infinite elements or even uncountable elements. In the following, we use L = {X, Y , Z} as an example. For simplicity, we only consider binary trees, meaning that each vertex has at most two children vertices. However, the methods in this paper also work for general k-ary trees.
For an l-level tree T and any m ≥ l , we construct its levelm completion T (m) as the following: For a vertex not in level m, if it has less than two children vertices, add children vertices to it until it has two. Newly added vertices have the label "N" (means "null"). Repeat this procedure, until every vertex not in level m has two children vertices, and every vertex in level m has no children vertices. In other words, we construct a perfect binary m-level tree. See Figs. 9 and 10 for two trees and their completions with different levels. For trees after completion, the label set is L = L ∪ {N } , which is {X, Y , Z, N } in our examples. For now, we just require that there is a metric d on L . In this paper, for simplicity, we shall apply the trivial metric that different labels always have distance 1. Later, we will also need a total order on L .
A vertex is called ordered if in the planar embedding of this tree, we know which of its child vertex is the left child, and which is the right child. Otherwise, it is called unordered, and we can switch its two subtrees in the planar embedding. A tree is ordered if all its vertices are ordered. A tree is unordered if all its vertices are unordered.
Each ordered tree corresponds to a unique planar embedding. In the following, we do not distinguish an ordered tree and its planar embedding. For the space of rooted ordered trees with possibly repeated labels, we define that two trees are equivalent if one tree can transform into the other tree by switching subtrees of some vertices (labels are also switched along with the vertices). Here after transformations, two trees have the same tree topology, and corresponding vertices have the same label. The notation T 1 ∼ T 2 means T 1 , T 2 are equivalent, and T 1 ∼ T 2 means T 1 , T 2 are not equivalent. With this equivalence relationship, the space of ordered trees is divided into different equivalent classes. See Fig. 5 for an equivalent class of ordered trees, where four ordered trees are equivalent. An unordered tree corresponds to different planar embeddings (ordered trees). Since we can switch two subtrees of an unordered vertex, equivalent ordered trees represent the same unordered tree. Besides, non-equivalent ordered trees represent different unordered trees. Therefore, the space of unordered trees is isomorphic to the space of equivalent classes of ordered trees. The four ordered trees in Fig. 5 represent the same unordered tree.

Metrics
To define a metric on unordered trees, we can switch to equivalent classes of ordered trees. A metric D on the space of equivalent classes of ordered trees maps a pair of such trees to a non-negative real number, and it satisfies the following criteria for any trees T 1 , T 2 , T 3 : A metric that satisfies (A1)-(A3) also has another prop- Before introducing metrics on unordered trees, we first need a metric on the space of ordered trees (not equivalent classes). For two ordered trees T 1 and T 2 , consider their level-m completions, where m is no less than the depths of T 1 and T 2 . For these two completed m-level trees T 1 (m),T 2 (m) with the same tree topology, there is a bijection between vertices. We define the ordered tree metric D OT (T 1 , T 2 ) for such completed ordered trees: where i ′ ∈T 2 (m) is the corresponding vertex of i, d is the metric on the label set L , and c(i) is the weight coefficient that depends on the depth of i. In some scenarios, we want to emphasize the differences closer to the root (correspond to earlier developmental stages), meaning that we can assign a larger value to c(i) with smaller depth of i. For simplicity, we use c(i) = 1 for all vertices in this paper. We can see that the value of D OT does not depend on the choice of m. For tree T 12 in Fig. 9 and tree T 13 in Fig. 10, their D OT distance is since they have 4 pairs of corresponding vertices with different labels. In the rest of this paper, we always consider trees after completion of proper levels. Therefore, the number of vertices (tree size) n and the depth m satisfies n = 2 m+1 − 1.

Definition
We start to define metrics on the space T of unordered trees, namely the equivalent classes of ordered trees. For two ordered trees T 1 , T 2 (representing their equivalent classes), we can check all pairs of ordered trees that one is equivalent with T 1 , the other is equivalent with T 2 , and choose the best-match pair with the minimal D OT distance. We define D BM on equivalent classes of ordered trees: This D BM (T 1 , T 2 ) satisfies the criteria (A1)-(A3) for a metric, defined in the previous section. We name D BM the best-match metric. For the tree T 12 in Fig. 9 and the tree T 13 in Fig. 10, D BM (T 12 , T 13 ) = 4.
From the definition of the best-match metric D BM (T 1 , T 2 ) , we can see that changing one label of T 1 will make D BM (T 1 , T 2 ) change by at most 1. Therefore, the best-match metric is robust under small perturbations on labels. This property does not hold for the left-regular metric, the bottom-up distance, and the subtree distance.

A dynamic programming implementation
There are exponentially many trees being equivalent to a given tree. Thus brute-force searching is too expensive. Here we introduce a dynamic programming algorithm [16] for calculating the best-match metric D BM (T a , T b ) .  See Algorithm 1 for the workflow of calculating the best-match metric D BM . The idea is simple: For the root, we only need to determine whether the left and right subtrees should be switched. In either case, the problem is reduced to minimizing the distance between subtrees. In other words, the vertex correspondence that minimizes the distance between two trees also minimizes the distance between two subtrees.
In the appendix, we illustrate the detailed procedure of calculating D BM for the developmental trees of Arabidopsis thaliana and sea urchin. D BM is also applied to other developmental trees with tree size ∼ 100 , and it is discovered that species with similar developmental trees (i.e., smaller D BM ) are more likely to have the same anatomical traits [17].

Left-regular metric on unordered trees Preparation
Since the metric is defined on the equivalent classes of ordered trees, we need to guarantee that equivalent trees have the same behavior, namely D(T 1 , One idea is to transform a given tree into some "regular form", which is unique to each equivalent class. We define a total order on the label set L , such as N > X > Y > Z . Ideally, similar labels should be closer. With this total order on the label set (alphabet), there is an induced total order, namely the lexicographic order [18], for strings of labels with the same length: for two strings, compare the corresponding labels from the beginning, until there is a difference, and apply the total order for labels. For example, XZN < XNY , since X = X , and Z < N . For a tree after completion, we can write its labels as a string, in the order of up-down (root-leaf ), left-right. This is named its label string. For example, the label string of T 12 (2) in Fig. 11 is XYZNNYZ. We can reconstruct a tree from its label string. Now we describe the procedure of left-regularization, through which a tree is transformed into its "regular form". Consider a tree T after level-l completion. For each vertex in level l − 1 , if the label string of its left subtree is larger than the label string of its right subtree, switch its left and right subtrees. This procedure is called "leftregularization". After the left-regularization of level l − 1 , repeat this procedure for level l − 2,l − 3, . . . , 1, 0. When the procedure is finished, we obtain the fully "left-regularized" form of T. The procedure of left-regularization for the tree T 12 (2) is shown in Fig. 11.
In a fully left-regularized tree, for each vertex, the label string of its left subtree is no larger than that of its right subtree. Thus each subtree is also fully left-regularized. By induction with the tree depth, we can see that two equivalent ordered trees have the same left-regularization. Two trees with the same left-regularization are obviously equivalent. Therefore, two ordered trees are equivalent if and only if their left-regularizations are the same. With this procedure, each unordered tree (or its corresponding equivalent class of ordered trees) corresponds to a unique left-regularized ordered tree.

Definition and properties
For a tree T, denote its fully left-regularized form as T .  trees are randomly generated, then the expectation of steps needed to compare two label strings is bounded by a constant C, regardless of string length. Thus the expected total number of steps is no more than C2 m+1 , and the expected time complexity is O(n) . When the trees are not binary, but k-ary, the orders of the worst-case time complexity and the expected time complexity are not changed. Both the best-match metric and the left-regular metric transform two trees by switching subtrees, and compare the trees after transformation. The best-match metric switches subtrees for two trees cooperatively, so as to find the pair that has the minimal D OT distance. The left-regular metric just switches subtrees independently, and the final pair might not be the best match. Thus we can see that for any two unordered trees T 1 , T 2 , D LR (T 1 , T 2 ) ≥ D BM (T 1 , T 2 ) . Thus D LR is an upper bound of D BM .
The tree T 13 (2) in Fig. 10 is already left-regularized. Thus we can compare it with T 12 (2) after left-regularization in Fig. 11 to find D LR (T 12 , T 13 ) = 5 . In the appendix, we illustrate the detailed procedure of calculating D LR for the developmental trees of Arabidopsis thaliana and sea urchin. For more examples, see Fig. 6 Consider an m-level tree. For each vertex in level l, to compare the label strings of its subtrees, we need at most 2 m−l steps. Therefore, the left-regularization on each level needs at most 2 m steps, and the total number of steps is no more than (m + 1)2 m . Thus the worst-case time complexity of computing the left-regular metric is O(n log n) , where n is the vertex number. The space complexity of computing the left-regular metric is trivially O(n) . If the Wang Algorithms for Molecular Biology (2022) 17:13

Best-match metric D BM
The procedure is recursive. We need to determine the correspondence of subtrees rooted in level 1, which depends on the correspondence of subtrees rooted in level 2, which then depends on the correspondence of subtrees rooted in level 3.
Step 1: Step 2.1: Step 3.1.1 to Step 3.1.4 (the same procedure) Back to Step 2.1 Step 2.2: Step 3.2.1 and Step 3.2.3 (the same procedure) Step 3. Back to Step 2.2 Step 2.3: Step 3. Back to Step 2.3 Step 2.4: Step 3.4.1 and Step 3.4.3 (the same procedure) Step 3.

Left-regularization metric D LR
We use the total order Z < X < W < S on the label set. We apply the left-regularization from level 2 to level 0. Left-regularization on level 2: For each vertex in level 2 of T A and T S , its left subtree and right subtree have the same label string, and we do not need to switch these subtrees. After this step, the label string of T A is WXZWWZZWWWWSSZZ, and the label string of T S is XXXWWWWXXXXWWWW .
Left-regularization on level 1: For the vertex with label "Z" in level 1 of T A , its left subtree label string is "ZSS", which is larger than that of its right subtree, "ZZZ". Thus we switch two subtrees of this vertex. For the other three vertices in level 1 of T A and T S , the left subtree and right subtree have the same label string, and we do not need to switch these subtrees. After this step, the label string of T A is WXZWWZZWWW-WZZSS, and the label string of T S is XXXWWWWXXXX-WWWW .
Left-regularization on level 0: For the root vertex (level 0) of T A , its left subtree label string is "XWWW WWW ", which is larger than that of its right subtree, "ZZZZZSS". Thus we switch two subtrees of this vertex. For the root vertex of T S , its left subtree label string is "XWWXXXX", which is smaller than that of its right subtree, "XWWW WWW ". Thus we do not need to switch these subtrees. After this step, the label string of T A is WZXZZWWZZSSWWWW , and the label string of T S is XXXWWWWXXXXWWWW .
The left-regularization results of T A and T S are in Fig. 13. We can calculate the D OT metric for these two trees. Since there are eight pairs of corresponding vertices with different labels, we have D LR (T A , T S ) = 8.